


o 



Optimal Discrete Uniform Generation from 
Coin Flips, and Applications 



^^ Jeremie Lumbroso 

O 

(N April 9, 2013 



Abstract 



This article introduces an algorithm to draw random discrete uniform va- 
riables within a given range of size n from a source of random bits. The al- 
gorithm aims to be simple to implement and optimal both with regards to the 
amount of random bits consumed, and from a computational perspective — 
c/3 allowing for faster and more efficient Monte-Carlo simulations in computa- 

I ^1 tional physics and biology. I also provide a detailed analysis of the number 

of bits that are spent per variate, and offer some extensions and applications, 
'""' in particular to the optimal random generation of permutations. 

_~^ Now that simulations can be run extremely fast, they are routinely able to consume 

Qs^ over a billion random variates an hour — and several orders of magnitude more 

^""^ throughout an execution. At such a point, the amount of pseudo-randomness at our 

■^ disposal may eventually become a real issue, and it is pertinent to devise techniques 

that are economical with respect to the amount of randomness consumed, while 
remaining as or more efficient than existing techniques with regards to speed and 
space usage. 



}_j The random-bit model. Much research has gone into simulating probability dis- 

^ tributions, with most algorithms designed using infinitely precise continuous uni- 

form random variables (see fl , II.3.7]). But because (pseudo-)randomness on com- 
puters is typically provided as 32-bit integers — and even bypassing issues of true 
randomness and bias — this model is questionable. Indeed as these integers have a 
fixed precision, two questions arise: when are they not precise enough? when are 
they too precise? These are questions which are usually ignored in typical fixed- 
precision implementations of the aforementioned algorithms. And it suggests the 
usefulness of a model where the unit of randomness is not the uniform random 
variable, but the random hit. 



This random bit model was first suggested by Von Neumann 111811 . who humo- 
rously objected to the use of fixed-precision pseudo-random uniform variates in 
conjunction with transcendant functions approximated by truncated serie^ His 
remarks and algorithms spurred a fruitful line of theoretical research seeking to de- 
termine which probabilities can be simulated using only random bits (unbiased or 
biased? with known or unknown bias?), with which complexity (expected number 
of bits used?), and which guarantees (finite or infinite algorithms? exponential or 
heavy-tailed time distribution?). Within the context of this article, we will focus on 
designing practical algorithms using unbiased random bits. 

In 1976, Knuth and Yao JH provided a rigorous theoretical framework, which 
described generic optimal algorithms able to simulate any distribution. These al- 
gorithms were generally not practically usable: their description was made as an 
infinite tree — infinite not only in the sense that the algorithm terminates with pro- 
bability 1 (an unavoidable fact for any probability that does not have a finite binary 
expansion), but also in the sense that the description of the tree is infinite and 
requires an infinite precision arithmetic to calculate the binary expansion of the 
probabilities. 

In 1997, Han and Hoshi provided the interval algorithm, which can be seen 
as both a generalization and implementation of Knuth and Yao's model. Using a 
random bit stream, this algorithm amounts to simulating a probability p by doing 
a binary search in the unit interval: splitting the main interval into two equal sub- 
intervals and recurse into the subinterval which contains p. This approach natu- 
rally extends to splitting the interval in more than two subintervals, not necessarily 
equal. Unlike Knuth and Yao's model, the interval algorithm is a concrete algo- 
rithm which can be readily programmed... as long as you have access to arbitrary 
precision arithmetic (since the interval can be split to arbitrarily small sizes). 

In 2003, Uyematsu and Li flTl gave implementations of the interval algorithm 
which use a fixed precision integer arithmetic, but these algorithms approximate 
the distributions (with an accuracy that is exponentially better as the size of the 
words with which the algorithm gets to work is increased) even in simple cases, 
such as the discrete uniform distribution which they use as an illustrating example 
using n = 3. 

I was introduced to this problematic through the work of Flajolet, Pelletier and 
Soria JSl on Bujfon machines, which are a framework of probabilistic algorithms 
allowing to simulate a wide range of probabilities using only a source of random 
bits. 



'Or in his words, as related by Forsythe: "/ have a feeling, however, that it is somehow silly to 
take a random number and put it elaborately into a power series^ 



Discrete uniform distribution. Beyond these generic approaches, there has been 
much interest specifically in the design of efficient algorithms to sample from the 
discrete uniform distribution. While it is the building brick to many other more 
complicated algorithms, it is also notable for being extremely common in various 
types of simulations in physics, chemistry, etc. 

Wu et al. EOll were among the first to concretely consider the goal of saving 
bits as much as possibly: they note that in practice small range uniform variables 
are often used, and thus slice a 32-bit pseudo-random integer into many smaller 
integers which they then reject. Although they do this using a complicated scheme 
of Boolean functions, the advantages are presumably that the operations being on 
a purely bitwise level, they may be done by hardware implementations and in pa- 
rallel. 

Orlov lfT4l gives an algorithm which reduces the amount of rejection per call, 
and assembles a nice algorithm using efficient bit-level tricks to avoid costly di- 
visions/modulo as much as possible; yet his algorithm still consumes randomness 
as 32-bit integers, and is wasteful for small values (which are typically the most 
common in simulations). 

Finally Ladd ifTOl considers the problem of drawing from specific (concentra- 
ted) distributions for statistical physics simulations; he does this by first defining a 
lookup table, and then uniformly drawing indexes in this table. His paper is notable 
to us because it is written with the intent of making as efficient a use of random bits 
as possible, and because he provides concrete implementations of his algorithms. 
However, as he draw discrete uniform variables simply by truncating 32-bit inte- 
gers, his issue remains the same: unless his lookup table has a size which is a power 
of two, he must contend with costly rejection which increases running time, in his 
simulations, more than fourfold (see his compared results for a distribution with 8 
states, and with 6 states). 

Our contribution. Our main algorithm allows for the exact sampling of discrete 
uniform variables using an optimal number of random bits for any range n. 

It is an extremely efficient implementation of Knuth and Yao's general frame- 
work for the special case of the discrete uniform distribution: conceptually simple, 
requiring only 2 log n bits of storage to draw a random discrete uniform variable 
of range n, it is also practically efficient to the extent that it generally improves 
or matches previous approaches. A full implementation in CIC++ is provided as 
illustration at the end of the article, in Appendix [Aj 

Using the Mellin transform we precisely quantify the expected number of bits 
that are used, and exhibit the small fluctuations inherent in the base conversion 
problem. As expected, the average number of bits used is slightly less good than 



the information-theoretic optimality — drawing a discrete uniform variable comes 
with a small toll — and so we show how using a simple (known) encoding scheme 
we can quickly reach this information-theoretic optimality. Finally, using a similar 
method, we provide likewise optimal sampling of random permutations. 

1 The Fast Dice Roller algorithm 

The Fast Dice Roller algorithm, hereafter abbreviated FDR, is very simple, 
and can be easily implemented in a variety of languages (taking care to use the shif- 
ting operation to implement multiplication by 2). It takes as input a fixed integer 
value of n, and returns as output a uniformly chosen integer from {0,...,n — 1}. 
The flipO instruction does an unbiased coin flip, that is it returns or 1 with equi- 
probability. Both this instruction (as a buffer for a PRNG which generates 32-bit 
integers) and the full algorithm are implemented in C/C-i~i- at the end of the article, 
in Appendix [A] 

Theorem 1. The Fast Dice Roller algorithm described below returns an in- 
teger which is uniformly drawn from the set {0, . . . , n — 1} and terminates with 
probability 1. 
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function FASTDlCEROLLER(n) 
f ^ 1; c ^ 
loop 

f ^ 2f 
c ^ 2c + flipO 
iiv ^ n then 
if c < n then 
return c 
else 

V ■^ V — n 
c ■^ c — n 
end if 
end if 
end loop 
end function 



Proof. Consider this statement, which is a loop invariant: c is uniformly distributed 
over {0,...,f — 1}. Indeed, it is trivially true at initialization, and: 



• in lines 4 and 5, the range v is doubled; but c is doubled as well and added a 
parity bit to be uniform within the enlarged new range; 

• lines 10 and 1 1 are reached only if c ^ n; conditioned on this, c is thus uni- 
form within {n, . . . ,v — 1}, with this range containing at least one integer 
since we also have f > c ^ n; as such we are simply shifting this range 
when substracting n from both c and v. 

The correctness of the algorithm follows from this loop invariant. As c is always 
uniformly distributed, when the algorithm returns c upon verifying that v ^ n (the 
current range of c is at least n) and c < n (the actual value of c is within the range 
we are interested in), it returns a uniform integers uniform in{0,...,n — 1}. 

The termination and exponential tails can be proved by showing that an equi- 
valent, less efficient algorithm is geometrically distributed: in this equivalent al- 
gorithm, instead of taking care to recycle random bits when the condition on line 
7 fails, we simply restart the algorithm; by doing so, we have an algorithm that 
has probability p = n/2L'°S2 "J+i > 1/2 of termination every [log2 nj + 1 itera- 
tions. D 

The space complexity is straightforward: by construction c < v and v < 2n 
are always true; thus c and v each require 1 + log2 n bits. The time complexity, 
which also happens to be the random bit complexity, since exactly one random bit 
is used per iteration, is more complicated and detailed in the following section. 

Remark. Most random generation packages do not come with a flip or random boolean 
operation (and those which do provide such commodity usually do so in a grossly inef- 
ficient way). Thus a concrete way of consuming random bits is to hold 32-bit random 
numbers in a temporary buffer variable and use each bit one after the other. What is surpri- 
sing is that the overhead this introduces is more than compensated by the savings it brings 
in random bits — which are costly to generate. 

Remark. It should be noted that this algorithm can be straightforwardly adapted to the 
problem of simulating a BernoulU law of rational parameter p, as illustrated in AppendixJB] 

2 Analysis of the expected cost in random bits 

Theorem 2. The expected number Un of random bits needed to randomly draw a 
uniform integer from a range of length n using the FDR algorithm is asymptoti- 
cally 

1 1 7 

Un = log2 n+- + - -^ + P(log2n) + 0{n-'^) 

I log 2 log 2 
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Figure 1. Plot of the expected cost of generating a random discrete uniform va- 
riable, where the a;-axis is the (discrete) range n of the variable: the red curve 
is computed from the exact sum, the blue curve is computed from the asymp- 
totic expression of Theorem |2] (using only a dozen roots in the trigonometric 
polynomial P). 



for any a > 0, and where P is a periodic function, a trigonometrical polynomial 



defined in Equation (11). 



Remark. Furthermore, the FDR algorithm terminates with probability 1, and as proven in 
the previous section, the distribution of its running time (and number of random bits) has 
exponential tails^ 

The remainder of this section is dedicated to proving this theorem. First, we will 
revisit some of Knuth and Yao's main results, and by showing the FDR algorithm is 
an implementation of their theoretical framework to the special case of the discrete 
uniform distribution, we obtain an expression of its expected cost in random bits 
as an infinite sum. Then, we use a classical tool from analysis of algorithms, the 
Mellin transform, to obtain the sharp asymptotic estimate stated in the theorem. 



2.1 Knuth and Yao's optimal DDG-trees 

As mentioned, Knuth and Yao introduce a class of algorithms, called DDG-trees, 
to optimally simulate discrete distributions using only random bits as a randomness 
source. 



"A random variable X is said to have exponential tails if there are constants C and p < 1 such 

thatP[X > fc] < Cp''. 



The discrete distributions are defined by a probability vector p = (pi , . . . , p„), 
which is possibly infinite. A DDG-tree is a tree (also possibly infinite) where: 
internal nodes indicate coin flips; external nodes indicate outcomes; at depth k 
there is an external node labeled with outcome i if and only if the A;-th bit in the 
binary expansion oipi is 1. 

Knuth and Yao do not provide an efficient way to build, or simulate, these 
DDG-trees — and this is far from a trivial matter. But one of their main results [8l 
Theorem 2.1] is that DDG-trees provide simulations which are optimal in number 
of random bits used, with an average complexity which, if finite, is 

l^{p) = U{pi) + . . . + U{pn) (1) 

where i^ is a function defined by 

fc=0 

where x i— ;■ {x} denotes the fractional part function. A straightforward conse- 
quence is that the optimal average random-bit complexity to simulate the discrete 
uniform distribution, that is where p = {l/n, . . . , 1/n), is 

2.2 The FDR algorithm, as an implementation of a DDG-tree 

This sum is the exact complexity of the algorithm presented in Section [TJ this is 
best understood by noticing that the FDR algorithm is an implementation of a 
DDG-tree. The algorithm efficiently computes the binary expansion of 1/n: every 
iteration computes a single bit, and those iteration where the condition line 6 is 
verified are those where this bit is equal to 1, and where, according to Knuth and 
Yao's framework, the DDG-tree should have n terminal leaves. The variable c 
simulates a path within the tree, from the root to one of the leaves. 

2.3 The Mellin transform of a base function 

The Mellin transform is a technique to obtain the asymptotics of some types of 
oscillating functions. It is of central importance to the analysis of algorithms be- 
cause such oscillating functions appear naturally (for instance, in most all analyses 
having to do with divide-and-conquer type recursions for instance), and their per- 
iodic behavior can generally not be quantified with other usual, and less precise 
asymptotic techniques 
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Figure 2. Tree of the branching process of the FDR algorithm for the case where 
71 = 5; a single process is a path from the root to a leaf, where each internal node 
corresponds to a new random bit being used, and the leaves correspond to the 
outcome. This tree is exactly a DDG-tree for simulating the uniform distribution 
of rang n — 5. Indeed, note that the binary expansion of 1/5 is periodic and 
goes 1/5 = 0.001100110011 . . .2, which corresponds with the altemance of 
levels with and without leaves in the corresponding tree. 

Definition 1. Let / be a locally Lebesgue-integrable function over (0, +cxd). The 
Mellin transform of / is defined by as the complex function, s E C, 

/•+00 

M[fix),x,s] = ris)= fix)x'-'dx. 

Jo 

The largest open strip a < Re(s) < /3 in which the integral converges is called the 
fundamental strip. We may note this strip <a, f3>. 

Important properties. Let / be a real function; the following F is called a har- 
monic sum as it represents the linear superposition of "harmonics" of the base 
function /, 

F{x) = Y, ^kfif^kx) (4) 

k 

and the A^ are called the amplitudes, the fi^ the frequencies Pl . Most functions 
like F usually involve subtle fluctuations which preclude the use of real asymptotic 
techniques. 

The first important property we will make use of is that the Mellin transform 
allows us to separate the behavior of the base function from that of its harmonics. 

8 



Indeed, if f*{s) is the Mellin transform of /, the MeUin transform of the harmonic 
sum F involving / is then simply 

The second property that is central to the analyses in this chapter is that the beha- 
vior of a function / in and in +00 can be directly respectively read on the poles 
to the left or right of the Mellin transform /*. 

2.4 Mellin transform of the fractional part 

Before proceeding to the Mellin transform of the harmonic sum which we are in- 
terested in, using the principles described by Flajolet et al. |4], we must manually 
calculate the Mellin transform of the fractional part function using classical in- 
tegration tools — thus giving another proof of a result which seems to be due to 
Titchmarsh OH §2]. 

Lemma 1. Let f{x) = {1/x} be the fractional part of the inverse function, its 
Mellin transform, valid for < Re(s) < 1, is 

ns)= ^^'^ 



s 

Proof. From the observation that, for x > 1, f{x) = 1/x, we may split the inte- 
gral, 

x^~^dx = < - > x^'^Mx + / x'^~'^dx. 



'0 1 2; J Jo l^ ) Ji 

To integrate on the unit interval, we spUt according to the inverses of integers. 






n 

1 

n + l 



1 

n -. 



and furthermore 



-.00^ ^00 -. ^00 ^ 



Vn/" x^-Mx = --V-^ + -V -V 

Z^ / 1 s Z^ ^s-1 sZ^f^ + i)s-i s /L^ 



n=\ n+l n=l n=\ ^ ' n=\ ^ ' 

Each sum can be replaced by properly shifted Riemann's zeta function, 

00 „JL 



> n / x^ dx 

n=l n+l 



C(^) 



By analytic continuation, the two integrals of x*^^ are valid even outside of their 
initial domain of definition. They cancel each other out, and we are left with 



oo 



Finally, the fundamental strip in which this Mellin transform is defined can be 
found by observing that 

lim|-| = 0(1) = 0(xO) and Vx > 1, |-| = - = O(x-i). 

D 

Remark. Observe that we calculate the Mellin transform of {l/a:}, but this also provides 
the Mellin transform of {x}. Indeed this follows the following functional property 

M[f{x),x,s] = r{s) ^ M[f{l/x),x,s] = -r{-s) 

respectively on the fundamental strips <a, ji> and <— /3, ~a>, which is a special case of a 
more general rule expressing the Mellin transform f{x^), see for instance |4, Theorem 1]. 

2.5 Mellin transform of the discrete uniform average complexity 

We now have all the tools to study the harmonic sum we are interested, 

^ r 2M 1 

fc=0 "^ ■' 

Our first step is to transform this into a real function (replace the discrete variable n 
by a real variable x) and decompose this as a base function and harmonics, 



F(x):=xf;|||l=x5]/(2-'^x)2-'= with /(x) = |H 



We now use: the functional property we have recalled in Equation Q; the additio- 
nal property 101 Fig. 1] the Mellin transform of xf{x) is 

M[xf{x),x,s] = r{x + 1) (8) 

on the shifted fundamental strip <a — l,/3— 1>; and the Mellin transform of the 
fractional part, as stated in Lemma[T] With these, we finally obtain that 

F*(s) = ^(" + ^^ (9) 

^'' (l-2^)(s + l)- ^^' 
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This Mellin transform is defined on the fundamental strip — 1 < Re(s) < 0, and 
it has one double pole in s = (from the C(s + 1) and the denominator 1 — 2^^ 
which cancels out) which will induce a logarithmic factor, and an infinity of simple 
complex poles in s = likii j log 2 from which will come the fluctuations. 

Indeed, using the Mellin summation formula H p. 27], we obtain the asymp- 
totic expansion 






(10) 



where fi is the set of poles to the right of the fundamental strip, which we have just 
enumerated. Thus we get 



F x ~log22;+- + — — 

2 log 2 log 2 
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+ P(log2x) + 0{n~'^), 



for any arbitrary a > 0, and with P a trigonometric polynomial defined as: 
1 



P(log2a;) 



,-^ C(2ifc7r/log2 + l) . „., , . ,.., 

2^ ow.,/i„„o , . exp(-2ifc7rlog2x). (11) 



log 2 ^^ 2iA;7r/ log 2 + 1 

^ fc6Z\{0} ' ^ 



3 Algorithmic tricks to attain entropic optimality 

In the expression of the average random-bit cost, we can distinguish two parts, 

log2 n and in = - + , — - - ^ + i^(log2n) + 0(n~'^). 

2 log 2 log 2 

On one side, the expected log2n contribution that comes from "encoding" n in a 
binary base (using random bits); on the other, some additional toll when n is not a 
dyadic rational, i.e. n ^ 2^, and the generation process requires rejection. 




1024 



Figure 3. Plot of the toll t„ in the average random-bit cost of generating a 
discrete uniform variable, with n from 2 to 1024; as expected the plot exhibits a 
distinct logarithmic period. When \jn has a finite binary expansion, i.e. it is a 
dyadic rational with n — 2~^, the toll is equal to zero. 
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In their article, Knuth and Yao JH Theorem 2.2] prove that for all discrete distribu- 
tions (including the one we are interested in), this toll has the following bounds: 

^ t„ ^ 2. (12) 

Because this toll, as exhibited in Figure |3] is not monotonous and upperboun- 
ded by a constant, the implication is that it is generally more efficient (in terms of 
the proportion of bits which are wasted in the toll) to generate a discrete uniform 
of larger range, because this toll becomes of insignificant magnitude compared to 
the main logarithmic factor. 

In the remainder of this section, we use this observation to go beyond theoretic 
bounds and reach the entropic optimality for the generation of discrete uniform 
variables, and random permutations. 

3.1 Batch generation 

As it has been observed many times, for instance by Han and Hoshi 17] V.], it can 
prove more efficient to generate the Cartesian product of several uniform variables, 
than generating a single uniform variable — especially when the considered range 
is small. 

Thus, instead of generating a single discrete uniform variable of range n, we 
generate j variables at a time by drawing a discrete uniform variable Y of range n^ 
and we use its decomposition in n-ary base 

Y ■.= Xj-n^-^ + ... + Xi-n^ (13) 

to recover the Xi through a simple (albeit slightly costly) succession of integer 
divisions. As it turns out, this trick decreases the toll by a more than linear factor, 
as encapsulated by the following theorem. 

Theorem 3. The number Un,j of random bits needed to randomly draw a uniform 
integer from a range of length n increases when the random integers are drawn j 
at a time, 

un, = log, - + J (^ + i^ - i^) + jPi^og^n) + Oin-'^/j) 

for some any a > 0, so that as j tends to infinity, we reach asymptotical informa- 
tion theoretic optimality 

Un,oo ~ log2 n. 
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Figure 4. This plot illustrates the quadratic decrease of the periodic oscillations, 
and fast convergence of the logarithm, when generating several random discrete 
uniform variables at a time. In dark blue, the expected cost of generating one 
uniform variable; in light blue, the expected cost when generating six at a time 
(i.e., j — 6); as a point of comparison, in red, the information-theoretic optimal 
given by the binary logarithm. 



In practice, because of the quadratic rate by which this trick attenuates the impor- 
tance of the oscillations, it doesn't take much to get very close to the information 
theoretic optimum of log2 n bits. As illustrated by Figure |4j typically taking j = Q 
is already very good, and a significant improvement over j = 1. 

Larger values of j should be disregarded as, considering the size of words is 
finite (32, 64, or 128 bits, typically, depending on the computer architecture), it is 
of course important to keep in mind that j must be chosen so as to not cause an 
overflow. 

Proof. Using the function F{x) as defined in the proof of Theorem |2l we define 
F{x) = G{x^)/j. Classical rules of the Mellin transform show that 



valid in the fundamental strip where —j < Re(s) < 0. We now can define tnj the 
unitary toll for each discrete uniform variable of range n, when they are generated 
j at a time, as 

V r^ - A^] + ^Pilog^n) + 0(n-"/j). 



'-ji, 



2 log 2 log 2 
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Using Knuth and Yao's bound restated in Equation (H) , we have the rough bound 

2 

J 



^ tn,j ^ 



which is sufficient to show that as j tends to infinity, t„ j tends to zero. D 

3.2 Optimally generating random permutations 

Beside (both continuous and discrete) uniform variables, another elementary buil- 
ding block in random generation is random permutations. They are routinely used 
in a great many deal of applicationqj 

For example, on a related topic, one such application is to the automatic combi- 
natorial random generation methods formalized by Flajolet and his collaborators. 
In both the recursive method E §3] and Boltzmann sampling ||3l §4], only the 
shapes of labeled objects are sampled: the labeling can then be added after, by 
drawing a random permutation of the size of the object. 

But random permutations are also useful in other fields. In statistical physics HI 
§1.2.2], for instance in quantum physic simulations. 

It should be noted that the algorithmic ideas presented in this subsection are 
classical [ 1 , §XIII]. Their relevance in the context of this article is that, in conjunc- 
tion with the FDR algorithm, they allow to concretely attain previously theoretical- 
only optimal lower bounds — while for the most part remaining reasonably efficient. 

Asymptotical optimality using the Fisher- Yates shuffle A straightforward idea 
with the elements thus far presented, is as follows. Assuming one generates the 
uniform in batches with j sufficient so that we may assume that each uniform 
of range N takes log2 N + e bits, with some very small e, then the random bit 
complexity C of generating a permutation with the Fisher- Yates shuffle is 



C ~ log2 2 + log2 3 + . . .2 + log2 n + en = log2 n! + en (14) 

But unfortunately, even though by generating the separate uniform variables 
in large enough batches, we can decrease the toll considerably for each uniform 
variable, we will still have an overall linear amount of such toll when considering 



^And even so, their generation may still prove challenging, as recently evidenced by Microsoft. 
Indeed, as a concession to the European Union (which found Microsoft guilty of imposing its own 
browser to Windows users, to the detriment of the competition), Microsoft provided Windows 7 
users with a randomly permutated ballot screen for users to select a browser. But a programming 
error made the ordering far from "random" (uniform), which briefly caused a scandal. 
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function SHUFFLE(r) 

n^ \T\ 

for i = 1 to n do 

k <— i + DlSCRETEUNIFORM(n-i + l) 

SwAP(r, i, k) 
end for 
end function 

Figure 5. The Fisher- Yates random shuffle as described by Durstenfeld. The ar- 
ray T is indexed in one, and DiscreteUniform concordingly returns a random 
uniform number in the range to n — i + 1 included. 

the n variables that must be drawn. Furthermore it is not very practical to have to 
generate uniform in batches (this supposes that we are drawing many permutations 
at the same time). So we suggest another solution. 

Optimal efficiency using a succinct encoding. As early on as 1 888, Laisant lITll 
exhibited a more direct way of generating random permutations, using a mixed 
radix decomposition called factorial base system. 

Lemma 2. Let U be a uniformly drawn integer from {0, . . . , n! — 1}, and let Xn 
be the sequence such that 

C/ = X„-(n- l)! + ... + Xi-0! and Vi,O^Xi<? 

then the Xi are independent uniformly drawn integers from {0, . . . , i — 1}. 

Laisant observed that a permutation could then be constructed by taking the 
sorted list of elements, and taking the X„-th as first element of the permutation, 
then the X„_i-th of the remaining elements as second element of the permutation, 
and so oqj What is remarkable is that using this construction, it is possible to 
directly compute the number of inversions of the resulting permutation by summing 
all Xi, where inversions /(cr) of a permutation a are defined as 

I{a) := {{i,j) G N^ I i < j and ai > ctj} . (15) 

Unfortunately the algorithm requires the use of a chained list instead of an array, 
and thus has quadratic time complexity — which is prohibitivejj 

"'This idea is often associated with Lehmer who rediscovered it 1 12|. 

^Nevertheless it is notable that summing the Xi (without needing to compute the actual per- 
mutation) yields an interesting way of generating a random variable distributed as the number of 
inversions in a permutation of size n. 
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We can use a different bijection of integers with permutations which can be 
computed in linear time, by simply using the Xi as input for the previously descri- 
bed Fisher- Yates shuffle. In this way, we optimally generate a random permutation 
from a discrete uniform variate of range n! — 1, and show how to attain information 
theoretic optimality in a much less contrived way than described in the previous 
subsection. 

A caveat though is that word size may become a real issue: with 32-bit registers 
one can obtain permutations up to n = 12; with 64-bit, up to n = 21; with 128-bit 
up to n = 33. 

Remark. The general idea of numbering or indexing (i.e., establishing a bijection with 
a continuous range of integers containing zero) all objects of a combinatorial class of a 
given size is often called ranking and the inverse transformation — obtaining an object from 
its rank — is called unranking, but has also been referred to as the decoding method 1 
§XIII.1.2]. 

For a long time, devising such unranking schemes often relied on luck or combinatorial 
acumen. Martinez and Molinero 1131 eventually established a general approach by adap- 
ting the previously mentioned recursive random generation method of Flajolet et al. ||6|. 
While this approach is not necessarily efficient, it provides a usable algorithm to attain 
random-bit optimality for the random generation of many combinatorial structures. 

4 Conclusion 

It would have been conceivable that this article yield a theoretical algorithm of 
which the sole virtue would have been to provide a theoretical optimal complexity, 
while proving less than useful for practical use. 

But unexpectedly, it turns out that the extra buffering inherent in consuming 
randomness random-bit-by-random-bijj although time consuming, is more than 
compensated by the increased efficiency in using random bits compared with most 
common methods. 

It remains to be seen whether this is still the case on newer CPUs which contain 
embedded instructions for hardware pseudo random generation. However there are 
arguments that support this: first, assuming that hardware pseudo random genera- 
tion is to eventually become widespread enough for software to take advantage of 
it, it seems likely to take a significant time to be common; second, the computer 
architecture shift seems to be towards RISC architectures which are not burdened 
with such complex instructions. 



*The implementation of the flip() function. It involves: drawing a random 32-bit int, storing it in 
a temporary variable, and then extracting each bit as needed, while making sure to refill the variable 
once all 32 bits have been used. 
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Prospective future work. The result presented here interestingly yields, as a di- 
rect consequence, the expected cost of the alias method, a popular method to simu- 
late discrete distributions which are known explicitly as a histogram, also known as 
sampling with replacement. This method is often said to have constant time com- 
plexity, but that is under the model where discrete uniform variables are generated 
in constant time. 

There are many different applications which are still to be examined: seve- 
ral classical algorithms, which use discrete (and continuous) uniform random va- 
riables, where the random bit cost is as of yet unknown. 

Of particular interest, the process known as sampling without replacement, or 
sampling from a discrete distribution which evolves in a fairly predictable manner. 
The most promising algorithms for this problem follow the work of Wong and 
Easton |[T9]| . which uses a partial sum tree. It remains to be seen what is the overall 
bit complexity of this algorithm, and whether it can be improved (for instance by 
choosing a specific type of tree). 
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A Implementation of the main FDR algorithm 

This proposed implementation makes use of two non-standard packages: the Boost 
library's standard integer definition, to make sure that the buffer integer variable has 
the correct size; and the Mersenne Twister algorithm for the random generation of 
the 32-bit integers themselves (the code for this is wildly available). Note that 
using the correct integer type on a given machine will do; as will using another 
random integer generator than MT algorithm. 

tinclude <cstdlib> 

#include <boost/cstdint . hpp> // Fixed size integers 

tinclude "mtl9937ar . hpp" // Mersenne Twister 

using namespace std; 

// For benchmarking purposes only 
static uint64_t flip_count = 0; 

// Flip buffering variables 
static uint32_t flip_word = 0; 
static int flip_pos = 0; 

int flip (void) 
{ 

if (flip_pos ==0) { 
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flip_word = genrand_int32 ( ) ; 
flip_pos = 32; 



f lip_count++; 
f lip_pos — ; 

return (flip_word & (1 << flip_pos)) >> flip_pos; 
} 



inline uint32_t algFDR (unsigned int n) 

{ 

uint32_t V = 1, c = 0; 

while (true) 

{ 

V = V << 1; 

c = (c << 1) + flipO ; 

if(v >= n) 

{ 

if (c < n) return c; 

else 

{ 

V = V - n; 
c = c - n; 
} 
} 
} 
} 

B Simulating rational Bernoulli variables 

There is a well-known idea in random generation [2. XV. 1.2], to efficiently draw 
a random Bernoulli variable of parameter p: draw a geometric random variable of 
parameter 1/2, k G Geo(l/2); then return, as result of the Bernoulli trial, the k-th 
bit in the dyadic representation of p. 

Interestingly, this idea was already known to physicists, as evidenced by an 
early paper by Pierre et al. fT5], but seems not to be commonly used today in 
Monte-Carlo implementations. Internal simulations show that for typical Boltz- 
mann energy simulations drawing Bernoulli variables in this way consumes 16 
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times fewer random bits, and that simulations are accelerated by a 4 to 6 factor 
(this is less impressive than the number of saved bits because of the accounting 
overhead required to buffer 32-bit integers into single flips). 

The limitation of this algorithm is that obtaining the dyadic representation of 
any p is not a trivial matter. Fortunately for rational numbers it is simple enough, 
and although this is not a new contribution, for the sake of completeness we illus- 
trate it in Figure l6] 



function BlNARYBASE(fc/n) 


function BERNOULLl(A;/n) 


V <— k 


V -^ A; 


loop 


repeat 


V ■(^2v 


V ■(^ 2v 


ifv^n then 


iSv^n then 


V -(^ V — n 


u -^ f — n 


output 1 


b^ 1 


else 


else 


output 


b^O 


end if 


end if 


end loop 


until flipO = 1 


end function 


return b 




end function 



Figure 6. A simple algorithm to output the binary decomposition of a rational 
k/n, k < n, and the corresponding algorithm that simulates a Bernoulli distribu- 
tion of parameter p = k/n. Neither algorithm is dependent on the required pre- 
cision of the binary expansion. They both use only 1 + log2 n bits of space, and 
require only one shift, one substraction and one comparison per iteration. The 
Bernoulli simulation algorithm consumes on average two flips (random bits). 



Remark. With Knuth and Yao's theorem, this algorithm can be shown to be optimal: 
indeed, it simply requires drawing a geometric variable of parameter 1/2, which takes on 
average 2 bits. Coincidentally, that is the optimal cost of drawing any Bernoulli variable. 
Recall the u function defined as. 



fe=0 



•ID 



From the results recalled in Subsection |2.1| we have that the optimal average cost of dra- 
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wing a random Bernoulli variable of parameter p is, 

z.(p) + //(I - p) = ^ ^^ + _^ -^ 

_ - {2'-p} + {2^(l-p)} _ - 1 _^ 

fe=0 k=0 

hence the optimality. 
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